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Abstract: In the context of clustering, we assume a generative model 
where each cluster is the result of sampling points in the neighborhood of an 
embedded smooth surface; the sample may be contaminated with outliers, 
which are modeled as points sampled in space away from the clusters. We 
consider a prototype for a higher-order spectral clustering method based 
on the residual from a local linear approximation. We obtain theoretical 
guarantees for this algorithm and show that, in terms of both separation 
and robustness to outliers, it outperforms the standard spectral clustering 
algorithm (based on pairwise distances) of Ng, Jordan and Weiss (NIPS 
'01). The optimal choice for some of the tuning parameters depends on the 
dimension and thickness of the clusters. We provide estimators that come 
close enough for our theoretical purposes. We also discuss the cases of clus- 
ters of mixed dimensions and of clusters that are generated from smoother 
surfaces. In our experiments, this algorithm is shown to outperform pairwise 
spectral clustering on both simulated and real data. 
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1. Introduction 

In a number of modern applications, tlie data appear to cluster near some low- 
dimensional structures. In the particular setting of manifold learning [51, 47, 7, 
22, 17], the data are assumed to lie near manifolds embedded in Euclidean space. 
When multiple manifolds are present, the foremost task is separating them, 
meaning the recovery of the different components of the data associated with 
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the different manifolds. Manifold clustering naturally occurs in the human visual 
cortex, which excels at grouping points into clusters of various shapes [41, 21]. 
It is also relevant for a number of modern applications. For example, in cos- 
mology, galaxies seem to cluster forming various geometric structures such as 
one-dimensional filaments and two-dimensional walls [52, 39]. In motion segmen- 
tation, feature vectors extracted from moving objects and tracked along different 
views cluster along affine or algebraic surfaces [35, 23, 53, 10]. In face recognition, 
images of faces in fixed pose under varying illumination conditions cluster near 
low-dimensional affine subspaces [31, 6, 19], or along low-dimensional manifolds 
when introducing additional poses and camera views. 

In the last few years several algorithms for multi-manifold clustering were 
introduced; we discuss them individually in Section 1.3.3. We focus here on 
spectral clustering methods, and in particular, study a prototypical multiway 
method relying on local linear approximations, with precursors appearing in [12, 
1, 48, 2, 27]. We refer to this method as Higher-Order Spectral Clustering 
(HOSC). We establish theoretical guarantees for this method within a standard 
mathematical framework for multi-manifold clustering. Compared with all other 
algorithms we are aware of, HOSC is able to separate clusters that are much 
closer together; equivalently, HOSC is accurate under much lower sampling rate 
than any other algorithm we know of. Roughly speaking, a typical algorithm for 
multi-manifold clustering relies on local characteristics of the point cloud in a 
way that presupposes that all points, or at least the vast majority of the points, 
in a (small enough) neighborhood are from a single cluster, except in places like 
intersections of clusters. In contrast, though HOSC is also a local method, it 
can work with neighborhoods where two or more clusters coexist. 

1.1. Higher- Order Spectral Clustering (HOSC) 

We introduce our higher-order spectral clustering algorithm in this section, trac- 
ing its origins to the spectral clustering algorithm of Ng et al. [42] and the 
spectral curvature clustering of Chen and Lerman [12, 11]. 

Spectral methods are based on building a neighborhood graph on the data 
points and partitioning the graph using its Laplacian [22, 34], which is closely 
related to the extraction of connected components. The version introduced by 
Ng et al. [42] is an emblematic example — we refer to this approach as SC. It 
uses an affinity based on pairwise distances. Given a scale parameter e > and 
a kernel 0, define 

1^ U, Xi — X2. 

01 • ]| denotes the Euclidean norm.) Standard choices include the heat kernel 
^(s) = exp(— s^) and the simple kernel (j){s) = l{|s|<i}. Let Xi, . . . ,xiy S 
denote the data points. SC starts by computing all pairwise affinities W = 
(Wij), with Wij = a(xi,Xj), for i,j = 1, . . . ,iV. It then computes the matrix 
Z = (Zij) : Zij = W^j/{DiDJy/'^, where A = J2i<j<N^ij is the degree of 
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the ith point in the graph with similarity matrix W. Note that I — Z is the 
corresponding normahzed Laplacian. Providing the algorithm with the number 
of clusters K, SC continues by extracting the top K eigenvectors of Z, obtaining 
a matrix U G M.^^^ , and after normalizing its rows, uses them to embed the 
data into K^. The algorithm concludes by applying means to the embedded 
points. See Algorithm 1 for a summary. 

Algorithm 1 Spectral Clustering (SC) [42] 

Input: 

xi, X2 , Xjv: the data points 
e: the affinity scale 
K: the number of clusters 
Output: 

A partition of the data into K disjoint clusters 
Steps: 

1: Compute the affinity matrix W = (Wij), with Wij = «(xi, Xj). 

2: Compute the Z = (Zij) : Zij = Wrj / {DiDjY/^ , where Di = J2j Wij. 

3: Extract U = [ui, . . . , u^], the top K eigenvectors of Z. 

4: Renormalize each row of U to have unit norm, obtaining a matrix V. 

5: Apply K-means to the row vectors of V in to find K clusters. 

6: Accordingly group the original points into K disjoint clusters. 



Spectral methods utilizing multiway affinities were proposed to better exploit 
additional structure present in the data. The spectral curvature clustering (SCC) 
algorithm of Chen and Lerman [12, 11] was designed for the case of hybrid linear 
modeling where the manifolds are assumed to be afRne, a setting that arises in 
motion segmentation [35] . Assuming that the subspaces are all of dimension d — 
a parameter of the algorithm, SCC starts by computing the (polar) curvature 
of all {d + 2)-tuples, creating an A^'^^'^+^^-tensor. The tensor is then flattened 
into a matrix A whose product with its transpose, W = AA', is used as an 
affinity matrix for the spectral algorithm SC. (In practice, the algorithm is 
randomized for computational tractability.) Kernel spectral curvature clustering 
(KSCC) [10] is a kernel version of SCC designed for the case of algebraic surfaces. 

The SCC algorithm (and therefore KSCC) is not localized in space as it fits 
a parametric model that is global in nature. The method we study here may be 
seen as a localization of SCC, which is appropriate in our nonparametric setting 
since the manifolds resemble affine surfaces locally. This type of approach is 
mentioned in publications on afhnity tensors [1, 48, 2, 27] and is studied here 
for the first time, to our knowledge. As discussed in Section 4, all reasonable 
variants have similar theoretical properties, so that we choose one of the simplest 
versions to ease the exposition. Concretely, we consider a multiway affinity that 
combines pairwise distances between nearest neighbors and the residual from the 
best d-dimensional local linear approximation. Formally, given a set of m > d+2 
points, Xi, . . . ,x,„, define 

A(i(xi, . . . , Xm) = min max dist(Xj,L), (2) 

LeAd j=l,...,m 

where dist(x, S) :— infggs ||x — s|| for a subset S C and Ad denotes the set 
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of d-dimensional affine subspaces in M^. In other words, Aj;(xi, . . . ,Xm) is the 
width of the thinnest tube (or band) around a d-dimensional affine subspace that 
contains Xi, . . . , x^- (In our implementation, we use the mean-square error; see 
Section 3.) Given scale parameters e > 77 > and a kernel function 0, define the 
following affinity: ad{:x.i, • . • , x„i) = if xi, . . . , x„i arc not distinct; otherwise: 

Ad(xi, . . . ,x„J\ 



where diam(xi, . . . ,x,„) is the diameter of {xi, . . . ,Xm}. See Figure I for an 
illustration. 



Q;d(xi, . . . ,x™) 



diam(xi, . . . , x„ 



Fig 1: The circle is of radius e/2 and the band is of half-width -q. Assuming we use 
the simple kernel, the m-tuple on the left has affinity Ud equal to one, while the other 
two m-tuples have affinity equal to zero, the first one for having a diameter exceeding 
e and the second one for being 'thicker' than rj. 



Given data points Xi , . . . , Xjv and approximation dimension d, we compute 
all m-way affinities, and then obtain pairwise similarities by clique expansion [2] 
(note that several other options are possible [12, 48, 27]): 

= X! Q:<i(xj,Xj,Xi,, . . . ,x,,„_2). (4) 

il,...,im-2 

Though it is tempting to choose m equal to d + 2, a larger m allows for more 
tolerance to weak separation and small sampling rate. The down side is what 
appears to be an impractical computational burden, since the mere computation 
of W in (4) requires order 0{N"^) flops. In Section 1.4, we discuss how to 
reduce the computational complexity to 0{N^'^°''-^^) flops, essentially without 
compromising performance. 

Once the affinity matrix W is computed, the SC algorithm is applied. We call 
the resulting procedure higher-order spectral clustering (HOSC), summarized in 
Algorithm 2. Note that HOSC is (essentially) equivalent to SC when 77 > e, and 
equivalent to SCC when e = 00. 
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Algorithm 2 Higher Order Spectral Clustering (HOSC) 

Input: 

xi, X2, Xjv: the data points 

d, m: the approximation dimension and affinity order 

e, 77: the aiBnity scales 

K: the number of clusters 
Output: 

A partition of the data into K disjoint clusters 
Steps: 

1: Compute the affinity matrix W = (Wij) according to (4). 
2: Apply SC (Algorithm 1). 



1.2. Generative Model 

It is time to introduce our framework. We assume a generative model where the 
clusters are the result of sampling points near surfaces embedded in an ambient 
Euclidean space, specifically, the _D-dimensional unit hypercube (0,1)^. For a 
surface S C (0, 1)^ and r > 0, define its r-ncighborhood as 

B{S, T)^{xe (0, 1)^ : dist(x, S) < r}. 

The reach of S is the supremum over r > such that, for each x e B{S,t), 
there is a unique point realizing inf{||x — s|| : s e S} [20]. It is well-known 
that, for submanifolds, the reach bounds the radius of curvature from be- 
low [20, Lem. 4.17]. For a connection to computational geometry, the reach 
coincides with the condition number introduced in [43] for submanifolds with- 
out boundary. Let vo\(i(S) denote the d-dimensional Hausdorff measure, and dS 
the boundary of S within (0, 1)^. For an integer 1 < d < D ~ 1 and a constant 
K > 1, let S^{k) be the class of d-dimensional, connected, submanifolds 
S C (0, 1)^ of 1/k < diam(5) < k and reach(S') > 1/k, and if S has a bound- 
ary, dS is a (d — l)-dimensional submanifold with reach(9S') > Given 
surfaces Si, ... ,3^ G and r < 1/k, we generate clusters Xi,. . . , X^^ by 

sampling Nk points uniformly at random in B{Sk,T), the r-neighborhood of Sk 
in (0, 1)^, for all k = 1, . . . ,K. We call r the jitter level. Except for Section 2.3, 
where we allow for intersections, we assume that the surfaces are separated by 
a distance of at least S > 0, i.e. 

dist(5fc,S'£) := inf inf [[x - y|l > 5, Vk ^ £. (5) 

In that case, by the triangle inequality, the actual clusters are separated by at 
least S — 2t, i.e. 

dist{Xk,Xi) >6-2t. 

We assume that the clusters are comparable in size by requiring that Nk < C-^^i 
for all k ^ for some finite constant C- Let xi, . . . , x„ denote the data points 
thus generated. See Figure 2 for an illustration. 

Given data X := {xi, . . . , Xjv}, we aim at recovering the clusters Xi, . . . , X^. 
Formally, a clustering algorithm is a function taking data X, and possibly other 
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Fig 2: This figure illustrates tlie generative model. Left: Three surfaces (here curves) 
with their r-neighborhood. The curves are separated by at least 5. Right: Points sam- 
pled within the tubular neighborhoods of the surfaces. 



tuning parameters, and outputs a partition of X. We say that it is 'perfectly 
accurate' if the output partition coincides with the original partition of X into 
Xi, . . . , Xfc ■ Our main focus is on relating the sample size N and the separation 
requirement in (5) (in order for HOSC to cluster correctly), and in particular 
we let T and S vary with A''. This dependency is left implicit. In contrast, we 
assume that d, K, ( are fixed. Also, we assume that d, r, K are known throughout 
the paper (except for Section 2.1 where we consider their estimation). Though 
our setting is already quite general, we discuss some important extensions in 
Section 4. 

We will also consider the situation where outliers may be present in the data. 
By outliers we mean points that were not sampled near any of the underlying 
surfaces. We consider a simple model where outliers are points sampled uni- 
formly in (0,1)^ \ [Ji; B{Sk,So) for some So > 0, in general different from S. 
That is, outliers are at least a distance Sq away from the surfaces. We let Nq 
denote the number of outliers, while N still denotes the total number of data 
points, including outliers. See Figure 3 for an illustration. 

1.3. Performance in terms of Separation and Robustness 
1.3.1. Performance of SC 

A number of papers analyze SC under generative models similar to ours [3, 54, 
44, 40], and the closely related method of extracting connected components of 
the neighborhood graph [3, 37, 9, 36]. The latter necessitates a compactly sup- 
ported kernel 4> and may be implemented via a union-of-balls estimator for the 
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Fig 3: This figure illustrates the generative model with outliers included in the data. 



support of the density [16]. Under the weaker (essentially Lipschitz) regularity 
assumption 

C-^e'' <vold{B{s,e)nS)<Ce'', Ve e (0, 1/C), Vs e 5, (6) 
Arias-Castro [3] shows that SC with a compactly supported kernel is accurate 

(a V 6 denotes the maximum of a and b and S> if On/^iv — > oo as — > cxi). 
With the heat kernel, the same result holds up to a \/\og N multiplicative factor. 
See also [37, 36], which prove a similar result for the method of extracting 
connected components under stronger regularity assumptions. At the very least, 
(7) is necessary for the union-of-balls approach and for SC with a compactly 
supported kernel, because sep„ is the order of magnitude of the largest distance 
between a point and its closest neighbor from the same cluster [45]. Note that 
(6) is very natural in the context of clustering as it prevents S from being 
too narrow in some places and possibly confused with two or more disconnected 
surfaces. And, when C in (6) is large enough and k is small enough, it is satisfied 
by any surface S belonging to Indeed, such a surface resembles an affine 

subspace locally and (6) is obviously satisfied for an affine surface. 

When outliers may be present in the data, as a preprocessing step, we identify 
as outliers data points with low connectivity in the graph with affinity matrix 
W, and remove these points from the data before proceeding with clustering. 
(This is done between Steps 1 and 2 in Algorithm 1.) In the context of spectral 
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clustering, this is very natural; see, e.g., [12, 37, 3]. Using the pairwise affinity 
(1), outliers are properly identified if Sq — t satisfies the lower bound in (7) and 
if the samphng is dense enough, specifically [3], 

Nk> {N'^^^ W NT°-'^)log{N), \/k = l,...,K. (8) 

When the surfaces are only required to be of Lipschitz regularity as in (6), we 
are not aware of any method that can even detect the presence of clusters among 
outliers if the sampling is substantially sparser. 



1.3.2. Performance of HOSC 



Methods using higher-order affinities are obviously more complex than methods 
based solely on pairwise affinities. Indeed, HOSC depends on more parame- 
ters and is computationally more demanding than SC. One, therefore, wonders 
whether this higher level of complexity is justified. We show that HOSC does 
improve on SC in terms of clustering performance, both in terms of required 
separation between clusters and in terms of robustness to outliers. 

Our main contribution in this paper is to establish a separation requirement 
for HOSC which is substantially weaker than (7) when the jitter r is small 
enough. Specifically, HOSC operates under the separation 



(5 - 2t > (t A sep„) V sep' 



2 

N ' 



(9) 



where a A 6 denotes the minimum of a and 6, and sep„ is the separation re- 
quired for SC with a compactly supported kernel, defined in (7). This is proved 
in Theorem 1 of Section 2. In particular, in the jitterless case (i.e. r = 0), the 
magnitude of the separation required for HOSC is (roughly) the square of that 
for SC at the same sample size; equivalently, at a given separation, HOSC re- 
quires (roughly) the square root of the sample size needed by SC to correctly 
identify the clusters. 




o o o oo^ 



S9 



Fig 4: Left: data. Middle; output from SC. Right: output from HOSC. The sampling 
is much sparser than in the original paper of Ng et al. [42], which is why SC fails. This 
figure is part of Figure 13 in Section 3, which displays more numerical experiments. 



That HOSC requires less separation than SC is also observed numerically. In 
Figure 4 we compare the outputs of SC and HOSC on the emblematic example 
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of concentric circles given in [42] (here with three circles). While the former fails 
completely, the latter is perfectly accurate. Indeed, SC requires that the majority 
of points in an e-ball around a given data point come from the cluster containing 
that point. In contrast, HOSC is able to properly operate in situations where the 
separation between clusters is so small, or the sampling rate is so low, that any 
such neighborhood is empty of data points except for the one point at the center. 
To further illustrate this point, consider the simplest possible setting consisting 
of two parallel line segments in dimension D — 2, separated by a distance 5 > 0, 
specifically. Si := {{t, 0) : t G [0, 1]} and 5*2 := {(i, 5) : t G [0, 1]}. Suppose N/2 
points are sampled uniformly on each of these line segments. It is well-known 
that the typical distance between a point on Sk and its nearest neighbor on Sk 
is of order 0(l/iV); see [45]. Hence, a method computing local statistics requires 
neighborhoods of radius at least of order 1/N, for otherwise some neighborhoods 
are empty. From (9), HOSC is perfectly accurate when S = (logN)^ /N"^ , say. 
When the separation 6 is that small, typical ball of radius of order 1/N around 
a data point contains about as many points from Si as from S2 (thus SC cannot 
work). See Figure 5 for an illustration. 




Fig 5: Clustering results obtained by SC (left) and HOSC (right) on a data set of two 
lines with small separation {S = 0.005). 100 points are sampled from each line, equally 
spaced (at a distance 0.01). Note that the inter-point separation on the same cluster 
is twice as large as the separation between clusters. In this case, SC cannot separate 
the two lines correctly, as we have argued. In contrast, HOSC performs perfectly when 
clustering the data, which again agrees with the theory and our expectation. We have 
also tried increasing the separation S from 0.005 to 0.025, in which case both SC and 
HOSC perform correctly. 

As a bonus, we also show that HOSC is able to resolve intersections in some 
(very) special cases, while SC is incapable of that. See Proposition 6 and also 
Figure 12. 

To make HOSC robust to outliers, we do exactly as described above, iden- 
tifying outliers as data points with low connectivity in the graph with affinity 
matrix W, this time computed using the multiway affinity (3). The separation 
and sampling requirements are substantially weaker than (8), specifically, 60 — t 
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is required to satisfy the lower bound in (9) and the sampHng 

{N'^''^'^^-'^^ y NT^-'^)\og{N), Vfc X. (10) 

This is estabhshed in Proposition 5, and again, we are not aware of any method 
for detection that is rehable when the sampUng is substantially sparser. For 
example, when t — Q and we are clustering curves {d = 1) in the plane {D — 2) 
(with background outliers), the sampling requirement in (8) is roughly 3> 
N'^l'^\og{N), compared to N^. > iV^/^ log(iV) in (10). In Figure 6 below we 
compare both SC and HOSC on outliers detection, using the data in Figure 4 
but further corrupted with 33.3% outliers. 




Fig 6: Left: data with outliers. Middle: outliers (black dots) detected by SC. Right: 
outliers (black dots) detected by HOSC. This figure is part of Figure 15 in Section 3, 
where more outliers-removal experiments are conducted. 



1.3.3. Other Methods 

We focus on comparing HOSC and SC to make a strong point that higher-order 
methods may be preferred to simple pairwise methods when the underlying 
clusters are smooth and the jitter level is small. In fact, we believe that no 
method suggested in the literature is able to compete with HOSC in terms of 
separation requirements. We quickly argue why. 

The algorithm of Kushnir et al. [32] is multiscale in nature and is rather 
complex, incorporating local information (density, dimension and principal di- 
rections) within a soft spectral clustering approach. In the context of semi- 
supervised learning, Goldberg et al. [25] introduce a spectral clustering method 
based on a local principal components analysis (PCA) to utilize the unlabeled 
points. Both methods rely on local PCA to estimate the local geometry of the 
data and they both operate by coarsening the data, eventually applying spec- 
tral clustering to a small subset of points acting as representative hubs for other 
points in their neighborhoods. They both implicitly require that, for the most 
part, the vast majority of data points in each neighborhood where the statistics 
are computed come from a single cluster. Souvenir and Pless [49] suggest an al- 
gorithm that starts with ISOMAP and then alternates in EM-fashion between 
the cluster assignment and the computation of the distances between points and 
clusters — this is done in a lower dimensional Euclidean space using an MDS em- 
bedding. Though this iterative method appears very challenging to be analyzed, 
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it relies on pairwise distances computed as a preprocessing step to derive the 
geodesic distances, which imphcitly assumes that the points in small enough 
neighborhoods are from the same manifold. Thus, like the SC algorithm, all 
these methods effectively rely on neighborhoods where only one cluster dom- 
inates. This is strong evidence that their separation requirements are at best 
similar to that of SC. The methods of Haro et al. [30] and Gionis et al. [24] 
are solely based on the local dimension and density, and are powerless when 
the underlying manifolds are of same dimension and sampled more or less uni- 
formly, which is the focus of this paper. The method of Guo et al. [29] relies on 
minimizing an energy that, just as HOSC, incorporates the diameter and local 
curvature of m-tuples, with m = 3 for curves and m = 4 for surfaces in 3D, 
and the minimization is combinatorial over the cluster assignment. In principle, 
this method could be analyzed with the arguments we deploy here. That said, 
it seems computationally intractable. 

1.4- Computational Considerations 

Thus it appears that HOSC is superior to SC and other methods in terms of 
separation between clusters and robustness to outliers, when the clusters are 
smooth and the jitter is small. But is HOSC even computationally tractable? 

Assume K and D are fixed. The algorithm starts with building the neigh- 
borhood graph (i.e., computing the matrix W). This may be done by brute 
force in 0{mN"^) flops. Clearly, this first step is prohibitive, in particular since 
we recommend using a (moderately) large m. However, we may restrict com- 
putations to points within distance e, which essentially corresponds to using a 
compactly supported kernel 4>. Hence, we could apply a range search algorithm 
to reduce computations. Alternatively, at each point we may restrict computa- 
tions to its i = Wjv log(A'^) nearest neighbors, with — >■ oo, or in a slightly 
different fashion, adapt the local scaling method proposed in [56] by replac- 
ing e in Q;(i(xij, . . . ,Xi^) by (e^^ • • • ei,^)^/"', where ti denotes the distance be- 
tween Xi and its £th nearest neighbor. The reason is that the central condition 
(12) effectively requires that the degree at each point be of order log(A^)™~^ 
(roughly), which is guaranteed if the nearest neighbors are included in the 
computations; see [3, 36] for rigorous arguments leading to that conclusion. 
In low dimensions, D = 0(loglog7V), a range search and ^-nearest-neighbor 
search may be computed effectively with kd-trees in 0(A^poly(logiV)) flops. In 
higher dimensions, it is essential to use methods that adapt to the intrinsic 
dimensionality of the data. Assuming that d is small, the method suggested 
in [8] has a similar computational complexity. Hence, the (approximate) affinity 
matrix W can be computed in order 0(Ai"poly(log7V)) -I- 0{N ■ £™); assuming 
m < log(A^)/(cjjv loglog(7V)), this is of order ©(iV^+i/""). This is within the 
possible choices for m in Theorem 1. 

Assume we use the ^-nearest-neighbor approximation to the neighborhood 
graph, with i = u]i^\og{N). Then computing Z may be done in 0{N'^'^'^^^'^) 
flops, since the affinity matrix W has at most ^™ = 0(iV^/"") non-zero coeffi- 
cients per row. Then extracting the leading K eigenvectors of Z may be done in 
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0(iS:iVl+l/"") flops, using Lanczos-type algorithms [15]. Thus we may run the 
^-nearest neighbor version of HOSC in 0(iVi+i/'^") flops, and it may be shown 
to perform comparably. 

We actually implemented the ^-nearest-neighbor variant of HOSC and tried it 
on a number of simulated datasets and a real dataset from motion segmentation. 
The results are presented in Section 3. The code is publicly available online [13]. 

1.5. Content 

The rest of the paper is organized as follows. The main theoretical results are 
in Section 2 where we provide theoretical guarantees for HOSC, including in 
contexts where outliers are present or the underlying clusters intersect. We em- 
phasize that HOSC is only able to separate intersecting clusters under very 
stringent assumptions. In the same section we also address the issue of estimat- 
ing the parameters that need to be provided to HOSC. In theory at least, they 
may be chosen automatically. In Section 3 we implemented our own version of 
HOSC and report on some numerical experiments involving both simulated and 
real data. Section 4 discusses a number of important extensions, such as when 
the surfaces self-intersect or have boundaries, which are excluded from the main 
discussion for simplicity. We also discuss the case of manifolds of different in- 
trinsic dimensions, suggesting an approach that runs HOSC multiple times with 
different d. And we describe a kernel version of HOSC that could take advantage 
of higher degrees of smoothness. Other extensions are also mentioned, including 
the use of different kernels. The proofs are postponed to the Appendix. 

2. Theoretical Guarantees 

Our main result provides conditions under which HOSC is perfectly accurate 
with probability tending to one in the framework introduced in Section 1.2. 
Throughout the paper, we state and prove our results when the surfaces have 
no boundary and for the simple kernel 4>{s) = l[\s\<i}j for convenience and ease 
of exposition. We discuss the case of surfaces with boundaries in Section 4.2 and 
the use of other kernels in Section 4.5. 

Theorem 1. Consider the generative model of Section 1.2. For — >■ oo slowly 
(e.g., = loglogA^j, assume the parameters of HOSC satisfy 



log N > m > 



log TV 



(11) 



Vlog Pn ' 



2 logA^ 



) 



1/d 




logiV 
N 



) 



l/D 



N 



(12) 



and 



(13) 
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Assume that (5) holds with 



S — 2t > e A PnV- 



(14) 



Under these conditions, when N is large enough, HOSC is perfectly accurate 
with probability at least 1 — N~P" . 

To relate this to the separation requirement stated in the Introduction, the 
condition (9) is obtained from (14) by choosing e and rj equal to their respective 
lower bounds in (12) and (13). 

We further comment on the theorem. First, the result holds if p„ = p and p 
is sufficiently large. We state and prove the result when oo as a matter 

of convenience. Also, by (11) and (14), the weakest separation requirement is 
achieved when m is at least of order slightly less than O(logiV) so that Pff 
is of order 0(1). However, as discussed in Section 1.4, the algorithm is not 
computationally tractable unless m — o{\ogN). This is another reason why we 
focus on the case where — >■ oo. Regarding the constraints (12)-(13) on e and 
77, they are there to guarantee that, with probability tending to one, each cluster 
is 'strongly' connected in the neighborhood graph. Note that the bound on e is 
essentially the same as that required by the pairwise spectral method SC [3, 36]. 
In turn, once each cluster is 'strongly' connected in the graph, clusters are 
assumed to be separated enough that they are 'weakly' connected in the graph. 
The lower bound (14) quantifies the required separation for that to happen. 
Note that it is specific to the simple kernel. For example, the heat kernel would 
require a multiplicative factor proportional to \/\og N. 

So how does HOSC compare with SC? When the jitter is large enough that 
T ^ (log(iV)/Af)^/'*, we have 77 > e and the local linear approximation contribu- 
tion to (3) does not come into play. In that case, the two algorithms will output 
the same clustering (see Figure 7 for an example). 
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Fig 7: Clustering results obtained by SC (left) and HOSC (right) on the data set of 
Figure 5, but with separation 5 = 0.025 and jitter r = 0.01. In this example, neither 
SC nor HOSC can successfully separate the two lines. This example supports our claim 
that when the jitter is large enough (relative to separation), HOSC does not improve 
over SC and the two algorithms will output the same clustering. 
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When the jitter is small enough that T < (log(Ar)/iV)i/rf, HOSC requires less 
separation, as demonstrated in Figure 5. Intuitively, in this regime the clusters 
are sampled densely enough relative to the thickness r that the smoothness of 
the underlying surfaces comes into focus and each cluster, as a point cloud, 
becomes locally well-approximated by a thin band. We provide some numerical 
experiments in Section 3 showing HOSC outperforming SC in various settings. 

Thus, HOSC improves on SC only when the jitter is small. This condition 
is quite severe, though again, we do not know of any other method that can 
accurately cluster under the weak separation requirement displayed here, even 
in the jitterless case. It is possible that some form of scan statistic (i.e., matched 
filters) may be able to operate under the same separation requirement without 
needing the jitter to be small, however, we do not know how to compute it in 
our nonparametric setting — even in the case of hybrid linear modeling where the 
surfaces are affine, computing the scan statistic appears to be computationally 
intractable. At any rate, the separation required by HOSC is essentially optimal 
when T is of order 0{N~^^'^) or smaller. A quick argument for the case d = 1 
and D = 2 goes as follows. Consider a line segment of length one and sample N 
points uniformly at random in its r- neighborhood, with r = 0{1/N). The claim 
is that this neighborhood contains an empty band of thickness of order slightly 
less than 0{1/N^), and therefore cannot be distinguished from two parallel line 
segments. Indeed, such band of half-width A inside that neighborhood is empty 
of sample points with probability (1 — A/r)^, which converges to 1 if NX/t — >■ 0, 
and when r = 0{1/N), this is the case if A = o(l/7V2). 

In regards to the choice of parameters, the recommended choices depend 
solely on (d, t, K) . These model characteristics are sometimes unavailable and 
we discuss their estimation in Section 2.1. Afterwards, we discuss issues such as 
outliers (Section 2.2) and intersection (Section 2.3). 

2.1. Parameter Estimation 

In this section, we propose some methods to estimate the intrinsic dimension d 
of the data, the jitter r and the number of clusters K. Though we show that 
these methods are consistent in our setting, further numerical experiments are 
needed to determine their potential in practice. 

Compared to SC, HOSC requires the specification of three additional pa- 
rameters. This is no small issue in practice. In theory, however, we recommend 
choosing d and K consistent with their true values, e and rj as functions of r, and 
m of order slightly less than log(A^). The true unknowns are therefore {d, t, K). 
We provide estimators for d and K that are consistent, and an estimator for 
T that is accurate enough for our purposes. Specifically, we estimate d and r 
using the correlation dimension [28] and an adaptation of our own design. The 
number of clusters K is estimated via the eigengap of the matrix Z. 
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2.1.1. The Intrinsic Dimension and the Jitter Level 

A number of methods have been proposed to estimate the intrinsic dimensional- 
ity; we refer the reader to [33] and references therein. The correlation dimension, 
first introduced in [28], is perhaps the most relevant in our context, since surfaces 
may be close together. Define the pairwise correlation function 

Cor(e) =^^l{||x,_x,||<e}• 
The authors of [28] recommend plotting logCor(e) versus loge and estimating 
the slope of the linear part. We use a slightly different estimator that allows us 
to estimate r too, if it is not too small. The idea is to regress log Cor(e) on log e 
and identify a kink in the curve. See Figure 8 for an illustration. 




3 

2 



1 I ' ' ' ' ' 1 

-6 -5 -4 -3 -2 -1 

log(E) 

Fig 8: A correlation curve for a simulated data set of 240 points sampled from the 
r-neighborhood of three disjoint one-dimensional curves (d = 1) in dimension ten 
{D = 10) crossing all dimensions. The jitter is r = 0.01. We see that the linear part 
of the curve has slope (near) 1, which coincides with the intrinsic dimension of the 
curves. The kink appears near f := exp(— 4.5) — 0.0111, a close approximation to r. 



Though several (mostly ad hoc) methods have been proposed for finding 
kinks, we describe a simple method for which we can prove consistency. Fix 
Pjv oo, with ^ logiV. Define 

"loglog(Ar)-logiVl 
d log 

Let Ar = logCor(p^'"). If there is r e {3, . . . , r„ — 2D — 1} such that 

{Ar~ Ar+l)/\0gp^ > D-1/2, 

then let f > be the smallest such r; otherwise, let f = Tn — ^D. Define f = p~^; 
and also d = £), if f = 3, and d the closest integer to (A3 — Af.)/{f log p^^), 
otherwise. 
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Proposition 1. Consider the generative model described in Section 1.2 with 
Si, ... , Sk G Assume that r < p^^ and, if there are Nq outliers, assume 

that N — Nq > N/p^. Then the following holds with probability at least 1 — 
j\r-VP«; if f < r,^ — 2D, then r S [t/ p^, pt^r]; if r — r^ — 2D, then t < f; 
moreover, if f > i, d — d. 

In the context of Proposition 1, tlie only time tliat d is inconsistent is when 
r is of order p~^ or larger, in which case d = D; this makes sense, since the 
region [Jf^ B(Sk,T) is in fact I?-dimensional if t is of order 1. Also, f is within 
a Pk factor of t if r is not much smaller than (log(A^)/A^)-'^/'^. 

We now extend this method to deal with a smaller t. Consider what we just 
did. The quantity Cor(e) is the total degree of the e-neighborhood graph built 
in SC. Fixing (d, m), we now consider the total degree of the 77-neighborhood 
graph built in HOSC. Define the multiway correlation function 

Cou.^{e,rj)=Y,D]/^^^-'\ 

i 

Similarly, we shall regress log Corf;^„i(e, ??) on log 77 and identify a kink in the 
curve (Figure 9 displays such a curve). 




-12 -10 -8 -6 -4 -2 -12 -11 -10 -9 -8 -7 -6 -5 -4 

log(E) log(T|) 

Fig 9: Correlation curves corresponding to SC (left) and HOSC (right) for the data set 
of Figure 8, but with a much smaller r = le — 4. We see that the pairwise correlation 
function works poorly in this case, while the multiway correlation curve has a kink 
near f := exp(— 10.5) = 2.754e — 5, within a factor of | of the true r. 



Using the multiway correlation function, we then propose an estimator f as 
follows. We assume that the method of Proposition 1 returned f = r„ — 2D, for 
otherwise we know that f is accurate. Choose d = d and m > log(A^)(logpjv)^. 
Note that this is the only time we require m to be larger than log N. Let = 
log Coi d,m{pN^ , Pn^^^)- If there is s € {0, . . . , f — 1} such that 

{B,- Bs+i)/ log p^> D-d- 1/2, 

then let s be the smallest one; otherwise, let s — f. We then redefine f as 

f = p-'-'+\ 



imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 



Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering 



17 



Proposition 2. In the context of Proposition 1, assume that f ^ — 2D. 
Then redefining f as done above, the following holds with probability at least 
1 — N^^^ : if s < f, then r £ [t j Pnt\; if s = r, then t < t. 

Now, f comes close to r if r is not mucli smaller than {log{N)/N)^/'^. Whether 
this is the case, or not, the statement of Theorem 1 applies with f in place of r 
in (13). 

Though our method works in theory, it is definitely asymptotic. In practice, 
we recommend using other approaches for determining the location of the kink 
and the slope of the linear part of the pairwise correlation function (in log- 
log scale). Robust regression methods with high break-down points, like least 
median of squares and least trimmed squares, worked well in several examples. 
We do not provide details here, as this is fairly standard, but the figures are 
quite evocative. 

2.1.2. The Number of Clusters 

HOSC depends on choosing the number of clusters K appropriately. A common 
approach consists in choosing K by inspecting the eigenvalues of Z. We show 
that, properly tuned, this method is consistent within our model. 

Proposition 3. Compute the matrix Z in HOSC with the same choice of pa- 
rameters as in Theorem 1, except that knowledge of K is not needed. Set the 
number of clusters equal to the number of eigenvalues ofZ( counting multiplic- 
ity) exceeding 1 — A^~^/p„. Then with probability at least 1 — N~P'^ , this method 
chooses the correct number of clusters. 

We implicitly assumed that d and t are known, or have been estimated as 
described in the previous section. The proof of Proposition 3 is parallel to that 
of [3, Prop. 4], this time using the estimate provided in part (Al) of the proof 
of Theorem 1. Details are omitted. 

Figure 10 illustrates a situation where the number of clusters is correctly cho- 
sen by inspection of the eigenvalues, more specifically, by counting the number 
of eigenvalue 1 in the spectrum of Z (up to numerical error) . This success is due 
to the fact that the clusters are well-separated, and even then, the eigengap is 
quite small. 

We apply this strategy to more data later in Section 3, and show that it can 
correctly identify the parameter K in some cases (see Figure 14). In general 
we do not expect this method to work well when the data has large noise or 
intersecting clusters, though we do not know of any other method that works 
in theory under our very weak separation requirements. 

2.2. When Outliers are Present 

So far we have only considered the case where the data is devoid of outliers. 
We now assume that some outliers may be included in the data as described at 

imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 



Arias-Castro, Chen and Lerman/ Higher Order Spectral Clustering 



18 



1.001 




Fig 10: The top six eigenvalues of the weight matrix Z obtained by HOSC in Step 2 for 
the same data used in Figure 8. Though in this example the clusters are well-separated, 
the eigengap is still very small (about 0.005). 



the end of Section 1.2. As stated there, we label as outlier any data point with 
low degree in the neighborhood graph, as suggested in [12, 37, 3]. Specifically, 
we compute D as in Step 2 of HOSC, and then label as outliers points with 
degree below some threshold. Let p„ -> oo slower than any power of N , e.g., 
Pm — log N. We propose two thresholds: 

(01) Identify as outliers points with degree: 

r,l/(m-l) / -1 j^l/(m-l) 

j ■' 

(02) Identify as outliers points with degree: 

Taking up the task of identifying outliers, only the separation between outliers 
and non-outliers is relevant, so that we do not require any separation between 
the actual clusters. We first analyze the performance of (01), which requires 
about the same separation between outliers and non-outliers as HOSC requires 
between points from different clusters in (14). 

Proposition 4. Consider the generative model described in Section 1.2. Assume 
that N — Nq > N/pf, and that (11)-(13) hold. In terms of separation, assume 
that Sq — T > e A p^T]. Then with probability at least 1 — N^P" , the procedure 
(01) identifies outliers without error. 

We now analyze the performance of (02), which requires a stronger separa- 
tion between outliers and non-outliers, but operates under very weak sampling 
requirements. 

Proposition 5. Assume that m is as in (11), and 

e = (p„ log(iV)/7V)i/(2^-'i), = (p„ log(iV)/7V)2/(2^-'^). (15) 
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In terms of separation, assume that Sq ^ t > e. In addition, suppose that 

Nk > Pn log(iV)iV''/(2i3-d) y j^j^D-d^ Vfc = 1, . . . , X. (16) 

Then with probability at least 1 — N^^" , the procedure (02) identifies outliers 
without error. 

If (5o — T, so that outliers are sampled everywhere but within the r-tubular re- 
gions of the underlying surfaces, then both (01) and (02) may miss some outliers 
within a short distance from some B{Sk,T). Specifically, (01) (resp. (02)) may 
miss outliers within e A p^r] (resp. within e) from some B(Sk, t). Using Weyl's 
tube formula [55], we see that there are order Nq{€ A Pn1i)^~'^ (resp. Nqc^^'^) 
such outliers, a small fraction of all outliers. 

The sampling requirement (16) is weaker than the corresponding requirement 
for pairwise methods displayed in (8). In fact, (16) is only slightly stronger than 
what is required to just detect the presence of a cluster hidden in noise. We 
briefly explain this point. Instead of clustering, consider the task of detecting 
the presence of a cluster hidden among a large number of outliers. Formally, 
we observe the data Xi , . . . , Xjv , and want to decide between the following two 
hypotheses: under the null, the points are independent, uniformly distributed in 
the unit hypercube (0, 1)^; under the alternative, there is a surface Si G S'^{k) 
such that Ni points are sampled from B{Si, r) as described in Section 1.2, while 
the rest of the points, N — Ni of them, are sampled from the unit hypercube 
(0, 1)^, again uniformly. Assuming that the parameters d and r are known, it 
is shown in [5, 4] that the scan statistic is able to separate the null from the 
alternative if 

We are not aware of a method that is able to solve this detection task at a 
substantially lower sampling rate, and (16) comes within a logarithmic factor 
from (17). We thus obtain the remarkable result that accurate clustering is 
possible within a log factor of the best (known) sampling rate that allows for 
accurate detection in the same setting. 

2.3. When Clusters Intersect 

We now consider the setting where the underlying surfaces may intersect. The 
additional conditions we introduce are implicit constraints on the dimension of, 
and the incidence angle at, the intersections. We suppose there is an integer 
< dint < d — 1 and a finite constant C > such that 

YoUB{Sk n Se, e) n Sk) < Ce''-''"^\ Ve e (0, 1/k), Vfc ^ £. (18) 

(The subscript int stands for 'intersection'.) In addition, we assume that for 
some 6'int G (0,7r/2], 

dist(x, Si)>S A sin(6'int) dist(x, Sk n 5"^), Vx e Sk, 'ik ^ I with Sk C^St^^. 

(19) 

imsart-ejs ver. 2011/11/15 file: higher-order-ejs-112811.tex date: November 30, 2011 



Arias-Castro, Chen and Lerman/Higher Order Spectral Clustering 



20 



(18) is slightly stronger than requiring that Sk H Si has finite dint-dimensional 
volume. If the surfaces are affine, it is equivalent to the condition dim(5'fe nS'^) < 
dint, Vfc £. (19), on the other hand, is a statement about the minimum an- 
gle at which any two surfaces intersect. For example, if the surfaces are affine 
within distance S of their intersection, then (19) is equivalent to their maxi- 
mum (principal) angle being bounded from below by ^int- See Figure 11 for an 
illustration. 



Fig 11: Illustration of intersecting surfaces. Though the human eye easily distinguishes 
the two clusters, the clustering task is a lot harder for machine learning algorithms. 
The main issue is that there are too many data points at the intersection of the two 
tubular regions. However, in very special cases HOSC is able to separate intersecting 
clusters (see Figure 12 for such an example). 



Proposition 6. Consider the setting of Theorem 1, with (5) replaced by (19). 
In addition, assume that (18) holds. Define 

7„ := 7V2e'*(e Ap„77)'*-'*'-(sin0int)''»'-''. 

Then there is a constant C > such that, with probability at least 1 — C ^m, 
HOSC is perfectly accurate. 

The most favorable case is when t = and ^i^t = 7r/2. Then with our choice 
of e and 77 in Theorem 1, assuming increases slowly, e.g., -< log TV, we 
have 7jv ^ if 2dint < d, and partial results suggest this cannot be improved 
substantially. This constraint on the intersection of two surfaces is rather severe. 
Indeed, a typical intersection between two (smooth) surfaces of same dimension 
d is of dimension d — 1, and if so, only curves satisfy this condition. Figure 12 
provides a numerical example showing the algorithm successfully separating 
two intersecting one-dimensional clusters. Thus, even with no jitter and the 
surfaces intersecting at right angle, HOSC is only able to separate intersecting 
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clusters under exceptional circumstances. Moreover, even when the conditions 
of Proposition 6 are fulfilled, the probability of success is no longer exponentially 
small, but is at best of order (l/A^)i-2di„t/d^ That said, SC does not seem able 
to properly deal with intersections at all (see also Figure 12). It essentially 
corresponds to taking 77 = e in HOSC, in which case never tends to zero. 




Fig 12: Left: data. Middle: output from HOSC. Right: Output from SC. This example 
shows that HOSC is able to separate intersecting curvihnear clusters when the inci- 
dence angle is perpendicular and there is no jitter (r = 0). In particular, the conditions 
of Proposition 6 are satisfied. On the contrary, SC fails in this case. 



Though the implications of Proposition 6 are rather limited, we do not know 
of any other clustering method which provably separates intersecting clusters 
under a similar generative model. This is a first small step towards finding such 
a method. 



3. Software and Numerical Experiments 

We include in this section a few experiments where a preliminary implemen- 
tation of HOSC outperforms SC, to demonstrate that higher-order affinities 
can bring a significant improvement over pairwise affinities in the context of 
manifold clustering. 

In our implementation of HOSC, we used the heat kernel 4>(s) = exp(— s^). 
Following the discussion in Section 1.4, at each point we restrict the compu- 
tations to its € nearest neighbors so that we practically remove the locality 
parameter e from the affinity function of (3) and obtain 



arf(xi, . . . ,x„ 



l0(Ad(xi,...,x,„)/7?), ifx2,...,x„ e £-NN(xi) distinct; 
I0, otherwise, 

(20) 

where £-NN(xi) is the set of the £ nearest neighbors of xi . For computational 
ease, we used 



A^^^(xi,...,x„) = min 



i X:dist(x„L)2, (21) 



which can be easily computed using the bottom m — d singular values of the 
m points. Note that, since Ad/\/m < < A^, the results we obtained apply. 
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with rj changed by a y'm factor, at most. (In the paper, the standard choice for 
77 is a power of N, while m is of order at most logA^, so this factor is indeed 
neghgible.) In practice, we always search a subinterval of [0, 1] for the best 
working rj (e.g., [.001, .1]), based on the smallest variance of the corresponding 
clusters in the eigenspace (the row space of the matrix V) , as suggested in [42] . 
When the given data contains outliers, the optimal choice of 77 is based on the 
largest gap between the means of the two sets of degrees (associated to the 
inliers and outliers), normalized by the maximum degree. The code is available 
online [13]. 

3.1. Synthetic Data 

We first generate five synthetic data sets in the unit cube (0, 1)^ (I? = 2 or 3), 
shown in Figure 13. In this experiment, the actual number of clusters (i.e. K) 
and dimension of the underlying manifolds (i.e. d) are assumed known to all 
algorithms. For HOSC, we fix £ = 10, m = d + 2, and use the subinterval 
[0.001, 0.1] as the search interval of 77. For SC, we considered two ways of tuning 
the scale parameter e: directly, by choosing a value in the interval [0.001,0.25] 
(SC-NJW); and by the local scahng method of [56] (SC-LS), with the number 
of nearest neighbors ^ = 5, . . . , 15. The final choices of these parameters were 
also based on the same criterion as used by HOSC. 

Figure 13 exhibits the clusters found by each algorithm when applied to 
the five data sets, respectively. Observe that HOSC succeeded in a number 
of difficult situations for SC, e.g., when the sampling is sparse, or when the 
separation is small at some locations. 

We also plot the leading eigenvalues of the matrix Z obtained by HOSC on 
each data set; see Figure 14. We see that in data sets 1, 2, 5, the number of 
eigenvalue 1 coincides with the true number of clusters, while in 3 and 4 there 
is some discrepancy between the Kth eigenvalue and the number 1. Though we 
do not expect the eigengap method to work well in general, Figure 14 shows 
that it can be useful in some cases. 

Figure 15 displays some experiments including outliers. We simply sampled 
points from the unit square (0, 1)^ uniformly at random and added them as out- 
liers to the first three data sets in Figure 13, with percentages 33.3%, 60% and 
60%, respectively. We applied SC and HOSC assuming knowledge of the propor- 
tion of outliers, and labeled points with smallest degrees as outliers. Choosing 
the threshold automatically remains a challenge; in particular, we did not test 
the theory. 

We observe that HOSC could successfully remove most of the true outliers, 
leaving out smooth structures in the data; in contrast, SC tended to keep isolated 
high-density regions, being insensitive to sparse smooth structures. A hundred 
replications of this experiment (i.e., fixing the clusters and adding randomly gen- 
erated outliers) show that the True Positive Rates (i.e., percentages of correctly 
identified outliers) for (SC, HOSC) are (58.1% vs 67.7%), (75.4% vs 86.8%) and 
(76.8% vs 88.0%), respectively. 
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Fig 13: Left column: data. (The third example shows a sphere containing an ellipsoid 
inside.) Middle column: best output from SC with the scale parameter chosen by 
both searching the interval [0.001,0.25] and applying local scaling [56] with at most 
15 nearest neighbors. Right column: output from HOSC. The optimal value of 77 is 
selected from the interval [0.001,0.1]. We also tried the simple kernel instead of the 
heat kernel, and obtained same results except in data set 3. 
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Fig 14: Top eigenvalues of the matrix Z obtained by HOSC on each of the five data 
sets in Figure 13 (in same order). 
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Fig 15: Outlier-removal experiments. Left column: data with outliers. The percentages 
of outliers are 33.3%, 60% and 60%, respectively. Middle: outliers (black dots) detected 
by pairwise spectral clustering (both SC-NJW and SC-LS, but only the better result 
is shown). Right: outliers (black dots) detected by HOSC. The use of the simple kernel 
(instead of the heat kernel) in HOSC gives very similar results. 
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3.2. Real Data 

We next compare SC and HOSC using the two-view motion data studied in 
[10, 46]. This data set contains 13 motion sequences: (1) boxes, (2) carsnbus3, (3) 
deliveryvan, (4) desk, (5) lightbulb, (6) manycars, (7) man-in- office, (8) nrbooks3, 
(9) office, (10) parking-lot, (11) posters-checkerboard, (12) posters-keyboard, and 
(13) toys- on-table; and each sequence consists of two image frames of a 3-D 
dynamic scene taken by a perspective camera (see Figure 16 for a few such 
sequences). Suppose that several feature points have been extracted from the 
moving objects in the two camera views of the scene. The task is to separate the 
trajectories of the feature points according to different motions. This apphca- 
tion, which Hes in the field of structure from motion, is one of the fundamental 
problems in computer vision. 




Fig 16: Three exemplary two- view motion sequences (arranged in columns): (4) desk, 
(6) manycars and (7) man-in-office. The true clusters are displayed in different colors 
and markers (the black dots are outliers). 

Given a physical point x G M"^ and its image correspondences in the two 
views {xi,yiy, (x2,y2)' G one can always form a joint image sample y = 
(si, j/i, a;2i 2/2j 1)' G M^. It is shown in [46] that, under perspective camera pro- 
jection, all the joint image samples y corresponding to different motions live on 
different manifolds in M^, some having dimension 2 and others having dimen- 
sion 4. Exploratory analysis applied to these data suggests that the manifolds in 
this dataset mostly have dimension 2 (see Figure 17). Therefore, we will apply 
our algorithm (HOSC) with d = 2 to these data sets in order to compare with 
pairwise spectral clustering (SC-NJW, SC-LS). 

We use the following parameter values for the two algorithms. In HOSC, we 
choose £ — 20, m — d + 2,rj £ [.0001, .1], while in SC we try both searching the 
interval [.001, .5] (SC-NJW) and local scaling with at most 24 nearest neighbors 
(SC-LS). 
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Fig 17: The true clusters of the three sequences in Figure 16 (in same order), shown 
in top three principal dimensions. (The outliers have been removed from the data and 
thus are not displayed). These plots clearly indicate that the underlying manifolds are 
two dimensional. 



The original data contains some outliers. In fact, 10 sequences out of the 
13 are corrupted with outliers, with the largest percentage being about 32%. 
We first manually remove the outliers from those sequences and solely focus on 
the clustering aspects of the two algorithms. Next, we add outliers back and 
compare them regarding outliers removal. (Note that we need to provide both 
algorithms with the true percentage of outliers in each sequence.) By doing so 
we hope to evaluate the clustering and outliers removal aspects of an algorithm 
separately and thus in the most accurate way. 

Table 1 

The misclassification rates and the numbers of true outliers detected by HOSC, 
SC-NJW and SC-LS. In the clustering experiment, the outliers-free data is used; then 
the outliers are added back so that each of these algorithms can be applied to detect 
them. For SC-NJW, the tuning parameter is selected from the interval [.001, .5]; for 
SC-LS, a maximum of 24 nearest neighbors are used; for HOSC, 20 nearest neighbors 
are used and the flatness parameter rj is selected from the interval [.0001, .1]. 



Data 


Clustering Errors 


# True Outliers Detected 


seq. 


#samplos 


#out. 


SC-NJW 


SC-LS 


HOSC 


SC-NJW 


SC-LS 


HOSC 


1 


115,121 


2 


0.85% 


0.85% 


0.85% 


1 


1 


1 


2 


85,45,89 


28 


0% 


0% 


0% 


24 


24 


24 


3 


62,192 





30.3% 


23.6% 


30.3% 


N/A 


N/A 


N/A 


4 


50,50,55 


45 


0.65% 


2.58% 


1.29% 


35 


30 


37 


5 


51,121,33 





0% 


0% 


0% 


N/A 


N/A 


N/A 


6 


54,24,23,43 





18.8% 


0% 


0% 


N/A 


N/A 


N/A 


7 


16,57 


34 


19.2% 


19.2% 


0% 


17 


12 


26 


8 


129,168,91 


32 


22.9% 


17.8% 


22.9% 


12 


17 


23 


9 


76,109,74 


48 


0% 


0% 


0% 


36 


28 


36 


10 


19,117 


4 


0% 


47.8% 


0% 








1 


11 


100,99,81 


99 


0% 


1.79% 


0% 


42 


39 


73 


12 


99,99,99 


99 


0.34% 


0.34% 


0% 


80 


43 


91 


13 


49,42 


35 


33.0% 


15.4% 


2.20% 


7 


6 


21 



Table 1 presents the results from the experiments above. Observe that HOSC 
achieved excellent clustering results in all but two sequences, with zero error on 
eight sequences, one mistake on sequence (13), and two mistakes on each of 
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(1) and (4). We remark that HOSC also outperformed the algorithms in [10, 
Table 1] , in terms of clustering accuracy, but due to the main aim of this paper, 
we do not include those results in Table 1. In contrast, each of SC-NJW and 
SC-LS failed on at least five sequences (with over 15% misclassification rates), 
both containing the two bad sequences for HOSC. As a specific example, we 
display in Figure 18 the clusters obtained by both HOSC and SC on sequence 
(7), demonstrating again that higher order affinities can significantly improve 
over pairwise affinities in the case of manifold data. Regarding outliers removal, 
HOSC is also consistently better than SC-NJW and SC-LS (if not equally good). 




Fig 18: Clustering results of both HOSC and SC (left to right) on sequence (7). (The 
truth is shown in Figure 17, rightmost plot). In this example, HOSC correctly found 
the two clusters, using geometric information; in contrast, SC failed because it solely 
relies on pairwise distances. 



4. Extensions 

4.I. When the Underlying Surfaces Self -Intersect 

In our generative model described in Section 1.2 we assume that the surfaces 
are submanifolds, implying that they do not self- intersect. This is really for 
convenience as there is essentially no additional difficulty arising from self- 
intersections. If we allow the surfaces to self- intersect, then we bound the maxi- 
mum curvature (from above) and not the reach. We could, for example, consider 
surfaces of the form S = /(Brf(0, 1)), where / : 5^(0, 1) ^ (0, 1)^ is locally bi- 
Lipschitz and has bounded second derivative. A similar model is considered 
in [38] in the context of set estimation. Clearly, proving that each cluster is 
connected in the neighborhood graph in this case is the same. The only issue 
is in situations where a surface comes within distance e from another surface 
at a location where the latter intersects itself. The geometry involved in such a 
situation is indeed complex. If we postulate that no such situation arises, then 
our results generalize immediately to this setting. 
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4-2. When the Underlying Surfaces Have Boundaries 

When the surfaces have boundaries, points near the boundary of a surface may 
be substantially connected with points on a nearby surface. See Figure 19 for 
an illustration. This is symptomatic of the fact that the algorithm is not able 
to resolve intersections in general, as discussed in Section 2.3, with the notable 
exception of clusters of dimension d = 1, as illustrated in the 'two moons' 
example of Figure 13. 




Fig 19: An example of a surface with a boundary coming close to another surface. 
This is a potentially problematic situation for HOSC as the points near the boundary 
of one surface and close to the other surface may be strongly connected to points from 
both clusters. Numerically, we show in Figure 13 such an example where HOSC is 
successful. 

If we require a stronger separation between the boundary of a surface and 
the other surfaces, specifically, 

dist{dSk,Si)>dt, yk^i, (22) 

with (5j — 2t > e, no point near the boundary of a cluster is close to a point 
from a different cluster. (A corresponding requirement in the context of outliers 
would be that outliers be separated from the boundary of a cluster by at least 
5o,t, with (5o,t — T > e.) 

4-3. When the Data is of Mixed Dimensions 

In a number of situations, the surfaces may be of different intrinsic dimensions. 
An important instance of that is the study of the distribution of galaxies in 
space, where the galaxies are seen to cluster along filaments {d =^ 1) and walls 
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(d — 2) [39]. We propose a top-down approach, implementing HOSC for each 
dimension d starting at £) — 1 and ending at 1 (or between any known upper 
and lower bounds for d). 

At each step, the algorithm is run on each cluster obtained from the pre- 
vious step, including the set of points identified as outliers. Indeed, when the 
dimension parameter of the algorithm is set larger than the dimension of the 
underlying surfaces, HOSC may not be able to properly separate clusters. For 
example, two parallel segments satisfying the separation requirement of Theo- 
rem 1 still belong to a same plane and HOSC with dimension parameter d = 2 
would not be able to separate the two line segments. Another reason for process- 
ing the outlier bin is the greater disparity in the degrees of the data points in 
the neighborhood graph often observed with clusters of different dimensions. At 
each step, the number of clusters is determined automatically according to the 
procedure described in Section 2.1, for such information is usually not available. 
The parameters e and 77 are chosen according to (15). Partial results suggest that, 
under some additional sampling conditions, this top-down procedure is accurate 
under weaker separation requirements than required by pairwise methods, which 
handle the case of mixed dimensions seamlessly [3]. The key is that an actual 
cluster Afe, as defined in Section 1.2, is never cut into pieces. Indeed, properties 
(Al) and (A4) in the proof of Theorem 1, which guarantee the connectivity and 
regularity (in terms of comparable degrees) of the subgraph represented by Xk, 
are easily seen to also be valid when the dimension parameter of the algorithm 
is set larger than d. (This observation might explain the success of the SCC 
algorithm of [12] in some mixed settings when using an upper bound on the 
intrinsic dimensions.) 

4.4- Clustering Based on Local Polynomial Approximations 

For 1 < d < D — \ and an integer r > 3, let S^{k) be the subclass of S]^{k) 
of d-dimensional submanifolds S such that, for every x € 5* with tangent Tx, 
the orthogonal projection S n i?(x, 1/k) — > Tx is a C-diffeomorphism with all 
partial derivatives of order up to r bounded in supnorm by k. For example, S^{k) 
includes a subclass of surfaces of the form S = f{Bj^{0, 1)), where / : i?d(0, 1) — >■ 
(0, 1)^ is locally bi-Lipschitz and has its first r derivatives bounded. (We could 
also consider surfaces of intermediate, i.e.. Holder smoothness, a popular model 
in function and set estimation [18, 38].) 

Given that surfaces in are well-approximated locally by polynomial sur- 
faces, it is natural to choose an affinity based on the residual of the best d- 
dimensional polynomial approximation of degree at most r — 1 to a set of points 
Xi , . . . , x,„ . This may be implemented via the "kernel trick" with a polynomial 
kernel, as done in [10] for the special case of algebraic surfaces. The main dif- 
ference with the case of surfaces that we consider in the rest of the paper 
is the degree of approximation to a surface S £ hy its osculating algebraic 
surface of order r — 1; within a ball of radius e, it is of order 0{e^). 

Partial results suggest that, under similar conditions, the kernel version of 
HOSC with r known may be able to operate under a separation of the form 
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(9), with the exponent 2/d replaced by r/d and, in the presence of outliers, 
within a logarithmic factor of the best known sampling rate ratio achieved by 
any detection method [5, 4]: 

min Nk > N'^/i^D-{r-i)d) y ^^D-d^ ^23) 

k 

Regarding the estimation of t, defining the correlation dimension using the 
underlying affinity defined here allows to estimate t accurately down to (essen- 
tially) {\og{N) / Ny / ^ if the surfaces are all in S^{k). The arguments are parallel 
and we omit the details. 

Thus, using the underlying affinity defined here may allow for higher ac- 
curacy, if the surfaces are smooth enough. However, this comes with a larger 
computational burden and at the expense of introducing a new parameter r, 
which would need to be estimated if unknown, and we do not know a good way 
to do that. 

4-5. Other Extensions 

The setting we considered in this paper, introduced in Section 1.2, was delib- 
erately more constrained than needed for clarity of exposition. We list a few 
generalizations below, all straightforward extensions of our work. 

• Sampling. Instead of the uniform distribution, we could use any other 
distribution with a density bounded away from and oo, or with fast 
decaying tails such as the normal distribution. 

• Kernel. The rate of decay of the kernel </) dictates the range of the affinity 
(3). Let Wjv be a non- decreasing sequence such that N^"^(j){u}N) — > 0. For 
a compactly supported kernel, Wjv = sup{s : 0(s) > 0}, while for the 
heat kernel, we can take uj^ = 2^m log N. As we will take m — oo, (/> is 
essentially supported in [0,w„] so that points that are further than uj^e 
apart have basically zero affinity. Specifically, we use the following bounds: 

0(l)l{|s|<l} < (t'is) < l{|s|<Ljjv} + H^^n)- 

The results are identical, except that statements of the form 5 — 2t > Z 
are replaced with d — 2t > LOf,Z . 

• Measure of flatness. As pointed out in the introduction, any reasonable 
measure of linear approximation could be used instead. Our choice was 
driven by convenience and simplicity. 

Appendix A: Preliminaries 

We gather here some preliminary results. Recall that, for a, e M, a V := 
max(a, 6); a A b :— niin(a, &); a+ — a V 0. For («„), (few) € , -< means 
a„ — 0{b,^)\ a„ X 6jv means both = 0(6jv) and b^ — 0{at^)] <^ 6jv 
means Ojv — o(b,^). For L G Ad, Pl denotes the orthogonal projection onto L. 
The canonical basis of MP is denoted ei, . . . , e£). 
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A.l. Large Deviations Bounds 



The following result is a simple consequence of Hoeffding's or Bernstein's in- 
equalities. 

Lemma 1 ([50], Lem. 5.3.7). Let {Xi)i>i be independent random variables in 
[0,1]. 

IfAa<J:^K{X,), 



Ifa>8j:^E{X,), 



(^^^Xi < < exp(— a). 
^^Xi > < exp(— a). 



A. 2. Some Geometrical Results 



We start by quantifying how well a surface S £ is locally approximated 

by its tangent. Recall that, for an afSne subspace T, Pt denotes the orthogonal 
projection onto T. For any s G 5, let Tg denote the tangent of S at s. 

Lemma 2. For any S G and s G 5, the orthogonal projection onto Tg is 

injective on B{s, 1/(4k)) H S and P^^ has Lipschitz constant bounded by \/2 on 
its image, which contains B(s, 1/(8k)) H Tg. Moreover, 

B{s,e)nS C B{Ts,Ke^), Ve, 

and 

S(s,e) nTg C B(S',2Ke2), Ve < 1/(8k). 

Proof. This sort of result is standard in differential geometry. We follow the 
exposition in [43]. We note that the manifold parameter t in [43], i.e., the 
inverse of the condition number, coincides with the manifold's reach. We thus 
fix here an 5 G and denote r := reach(S'). Since 1/k is a lower bound on 

the reach for manifolds in we have the inequality r > 1/k. 

Fix also a point s G S*. Applying [43, Lem. 5.4], we obtain that Pj; is one- 
to-one on B{s,e) S for any e < t/2, in particular, e < 1/(2k). We obtain an 
estimate on the image of Pt^ as follows. We note that [43, proof of Lem. 5.3] 
implies that 

PT,{B{s,e)nS) ^ B(s,ecosarcsin(e/(2T))) nTg. (24) 

Furthermore, 

if e < 1/(4k), cosarcsin(e/(2T)) > cos arcsin(Ke/2) > y/ 63/64 > 1/2. (25) 
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Combining (24) and (25), we conclude that 

Pt, (S(s, e) n S") D B(s,e/2) nTs, Ve < 1/(4k). (26) 

In particular, for e = 1/(4k), we obtain that the range of Pt^ (when applied to 
B{s, 1/(4k)) n S) contains the ball B(s, 1/(8k)) n Ts- 

Next, for any s' € B{s, 1/(4k)) n S, the derivative of the linear operator Pt^ 
in the direction u, a unit vector in Tg' i is 

Vu(PtJ = (PtJ • u = cos0i(r„span{u}) > cos0i{T,,T,,), (27) 

where 9i denotes the largest principal angle between the corresponding sub- 
spaces. In order to further bound from below the RHS of (27), we couple [43, 
Props. 6.2, 6.3] and use r > 1/k to obtain that 

cos 6*1 (Ts , > v/1-2k||s-s'||. (28) 

Combining (27) and (28) we conclude that Pj7^ has Lipschitz constant bounded 
by V2 in B(s, 1/(4k)) n T^. 

For the inclusions, we use the fact that 

||Pt.(x)-x|| <(«;/2).||s-x||2, Vx,se5, (29) 

which appears in [20, Th. 4.18(2)]. This immediately implies the first inclusion — 
which actually holds for any e > and with k replaced by k/2. The second 
inclusion follows by combining (26) with (29). □ 

Next, we estimate the volume of the intersection of the neighborhood of a 
surface and a ball centered at a point within that neighborhood. 

Lemma 3 ([3], Lem. 1). For S satisfying (6), x € B{S,t) and e,T > 0, 

voId{B{S, t) n P(x, e)) X e'^(e A r)^-^ volz5(B(5, r)) x r^-f 

The following result is on the approximation of a set of points in the neigh- 
borhood of a d-dimensional affine subspace by a d-dimensional affine subspace 
generated by a subset of d -I- 1 points. 

Lemma 4. There is a constant C > depending only on d such that, if 
Zi, . . . , z„i G B{L, rf), with L G Ad and m > d + 2, then there exists H £ Ad 
generated by d + 1 points among Zi, . . . , z„i, such that Zi, . . . , z,„ G B{H, Crj). 

Proof. For points ai, . . . , a^,, let aspanjai, . . . , a^} denote the affine subspace of 
minimum dimension passing through ai, . . . , a^. Let (ii, 12) € argmaxi^^ ||zi— Zj|| 
and, for d > A: > 3, 

ik G arg max dist(zj, aspan{zi^, . . . , Zi^ J). 
t^ii,...,ik-i 

Let Ak = aspanjzij, . . . jZ^j.^^}, for d > k > 1. Define Ai = \\zi^ — z.^Jj and, for 
d > k > 2, Xk = dist(zj^^j,span{zjj, . . . ,Zi^}). Also, let vi = (zj^-z^J/Ai and. 
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for fc > 2, Vfe = i^'if.^i ^ PAk-i'^ik+i) I Xk- Without loss of generality, assume that 
Zj^ is the origin, which allows us to identify a point z with the vector z— z^^ . Take 

z e {zi, . . . , Zm} and express it as z = aivi H h a^Vti + w, with w _L Ad- We 

show that ||w|| < C-q for a constant C depending only on d, which implies that 
z G B{Ad, Crj). Let Ci > 0, to be made sufficiently large later. By construction, 
Ai C ■■■ ^ Ad and Ai > • • • > with jjP^^ ^z|| < Afc for all fc = 1, . . . , d. 
Consequently, if A^ < Cirj, then |jw|j < IjP^i z|| < A^ < Cijy and we are done. 
Therefore, assume that A^ > Cirj. Define Pl'^u- We have 



Afc ''-I ^ Afc + Afc Ci 



Hence, for Ci large enough, qi, . . . , are linearly independent, and therefore 
span L. Suppose this is the case and define matrices V with columns vi, . . . , 
and Q with columns qi, . . . , q^. Then, by continuity, for Ci large enough we 
have 

\\Pl - PaA = ||Q(Q^Q)-'Q^ - VV^II < 1/2, 

where || • || here denotes the (Euclidean) operator norm. When Ci is that large, 
we have 

\\PlH = \\{Pl - PaJwII < ^||w|| < i (||Plw|| + ||Pi^w||) , 

so that ||Plw|| < ||P^_lw||. Now, using the triangle inequality, 

II-Pl^w|| < ||Pi^z|| + |ai|||Pi^vi|| + • • • + \ad\\\PL^Vd\\. (30) 

Because z G B{L,r]), we have ||Piiz|| < ij. For the other terms, we have 
ll-FL-LVfcll < jy/Afc as before, and, using the fact that, by construction, the 
Vi,...,Vd are orthonormal with Ak = span{vi, . . . , Vfc} and \\Pa^ ^^11 — ^k, 
together with the Cauchy-Schwartz inequality, we also have 



vIp^ 



< Xk 



Iflfcl = |VfcZ| 

Hence, the RHS in (30) is bounded by {d + l)rj, implying 

||w|| < ||Plw|| + ||Pl^w|| < 2\\Pl±w\\ < 2{d + 1)?]. 

We then let C = max(Ci ,2d+2). □ 

Below we provide an upper bound on the volume of the three-way intersection 
of the neighborhood of a surface, a ball centered at a point on the surface and the 
neighborhood of an affine d-dimensional subspace passing through that point, in 
terms of the angle between this subspace and the tangent to the surface at that 
same point. The principal angles between linear subspaces L,L' G Ad, denoted 

by 

^>9i{L,L') >--->ed{L,L')>0, 
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are recursively defined as follows: 



CQs9r{L,L') = mill min u u' = uj,, 



subject to 



l|u|| = ||u'||=l; 
u^u, = 0, Vs = 1, . . . ,r - 1; 
u'^u', = 0, Vs = 1,.. .,r- 1. 

Note that the orthogonality constraints are void when r = 1. (Some authors use 
the reverse ordering, e.g, [26].) 

Lemma 5. Consider a surface S £ S^{k). Suppose e > rjV t, rj > and r > 0. 
Let be the uniform distribution on B{S,t). For s G S, let Tg be the tangent 
space to S at s. Then for L G Ad containing s, 



d 



^{B{s, e) n B{L, rj)) -< e^{l A JJ 1 



A 



rjM T 



Proof. Fix s e S* and L £ Ad containing s, and let T :— Tg and 9j := 9j{L,T) 
for short. By definition, 

»(B(s,.)nB(L.„))= -'"°'^'^'-'"g';'""^'^-'"' . 

yo\d{B{S,t)) 

By Lemma 3, it suffices to show that 

yo\DiB{S, t) n B{L, r?) n i?(s, e)) -< e\ii A rf-" X{U^'^ 

We divide the proof into two cases; though the proof is similar for both, the 
first case is simpler and allows us to introduce the main ideas with ease before 
generalizing to the second case. 

Case < T. We use Lemma 2 and the fact that t > e^, to get 

B{S, r) n B{s, e) C B{T, (1 + k)t) n B{s, e). (31) 

Ignoring the constant factor 1 + k, we bound 

vo\d{B{T, t) n B{L, ri) n S(s, e)). 

We may assume without loss of generality that s is the origin and 

T = spanjei, . . . , ed], and 

L = span{ (cos 6*1)61 + (sin 6*1)6^+1, . . . , (cos6'<j)ed + (sin 6'rf)e2d}. 
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Then 



B{T,t) 












= {(^1,- 


■,zd) 


: ''^^{zj sin 9j — zj^j^j cos 9j) 

]<d 


3>2d 


Bis,e) 


- {(^1,- 


■,zd) 







j 



Take j < d; since l^d+jl < t, we have 

\zj sin 9j — Zd+j cos 0j \ < rj ^ \zj\ < 2{ri \/ t) / sin 6j < n{ri V r)/ 9j . 
Therefore, 

d 

i?(T,T)ni?(L,r;)nB(s,e)cn 

i=i 

From that we obtain the desired bound. 

Case T < . The arguments here are a httle different and we simply bound 
vo\d{B{S,t) n B{s,ej). Assume that e < 1/(8k). Because Pt is contractile, we 
have 

PriS n B(s, e)) C T n B(s, e) = 5^(0, e), 
so that, by Lemma 2, 

5ns(s,e) cPj:Hs<i(0,e)), 

where ^ : 5^(0, e) -> S" n B{s, e). Hence, 

B(5,r)nP(s,e) C{(a,b) :aePrf(0,e),||b-Pj;i(a)|| < r}. 

And by direct integration, the set on the RHS has D-volume of order f^r^-'^ 
since Pj7^ is Lipschitz on _Bj;(0,e) by Lemma 2. □ 

A companion of the previous result, the following lemma provides a lower 
bound on the angle between the affine subspace and the tangent. 

Lemma 6. Let e,r/ > 0, and take S G S^{k). Suppose L £ Ad is such that 
B{L, T]) contains s g 5 and y G B(s, e). Let Tg the tangent to S at s. Then 

e + r/ 

Proof. Let T denote Tg for short, and let L' be the line passing through s and 
PL(y). Since L' C L, we have 9i{L,T) > 9i{L',T), and using the triangle 
inequality and the fact that 9 > sm9, for 9 >0, this is bounded below by 

dist(PL(y),r) ^ dist(y,T)-7? 
dist(PL(y),s) - dist(s,y) + 77 ' 



-e A -,eA - 



XPD_d(0,7?AT). 
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The denominator does not exceed e + 77. For the numerator, 

dist(y,T) - WPriy) - y|| > dist(y, S) - dist(FT(y), 5)- 

Since ||y-s|| < e, we have Priy) £ TC^B{s,e), so that dist(PT(y), 5) < 2Ke'^ by 
Lemma 2. Consequently, the numerator is bounded from below by dist(y, S) — 
ne^ — 77. □ 

Next is another result estimating some volume intersections. It is similar to 
Lemma 5, though the conditions are different. 

Lemma 7. Consider a surface S € iSJ(k). Let 5* be the uniform distribution 
on B(S, t) . Then for e > 77 and r > 0, 

sup -^{Biy, e) D B{L, i^)) ^ e%l h 

where the supremum is over y G and L G Ad, and, the implicit constants 
depend only on k, d. Also, for e > lOrj, rj > lOne^ and t > 0, and any x G 
5(5, r), 

sup *(B(x, e) n B{L, 77)) y e'\l A (77/r))^-''. 

L 

Proof. The proof is similar to that of Lemma 5. We divide the proof into two 
parts. 

Upper bound. Let x G B{S, r) n B{y, e) n B{L, rj). When 77 > t, we use 

B{S, t) n B{y, e) n B{L, 77) C B{S, r) n B(x, 2e), 

while, when rj < t, we use 

B{S, t) n B{y, e) H B{L, C B{L, 7;) n B(x, 2e). 

In both cases, we conclude with Lemma 3. 

Lower hound. Let s be the point on S closest to x, with tangent subspace T. 
When 77 > 2t + 4Ke^ , take as L the translate of T passing through x and use 
Lemma 2 to get 

B{S, t) n B(x, e) C B{T, t + k{t + ef) C B{L, 77), 

and therefore 

B{S, t) n B(x, e) n B{L, 77) D B{S, r) n B{x, e). 

We then use Lemma 3. Now, suppose 7/ < 2t + Ane'^ and notice that, since 
77 > IOke^, we have r > 3k£^. First, assume that e > lOr. We use Lemma 2 to 
get 

B{S, r) n B(x, e) D B{T, t - 2Ke^) n B{s, e) n B(x, e), 
and therefore, 

B{S, t) n B(x, e) n B(L, 77) D B(T, r - 2Ke2) n B{L, ri) n B(s, e) n B(x, e). (32) 
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Without loss of generality, assume that x is the origin, L = spanjei, . . . ,6^;}. 
Since the volume is least when ||x — s|| = r, assume that s = red+i (seen as 
a point in space). Define v = [rj + 2Ke^)/2 and note that v < rj A (2t) by the 
conditions on rj and t. Then 





D {(zi,.. 


■,zd) 


^ E -1 


+ {zd- 








j>d+i 




B{s,e) 


= {{zi,-. 


■,zd) 


^ E 


+ {zd- 












i?(x,e) 


= {{zi,-. 


■,zd) 


:E^I^ 





By the conditions imposed on e, 77, r, the RHS in (32) contains 

Brf(0,e/10) X [77/4,3r,/4] x Bi5-d-i(0, ?7/10). 

Therefore the result. Finally assume that r > e/10 and take L passing through 
X and 2, — {1 — A)x + As, where A — e/(2r). We have ||z — x|j < e/2 and 
||z — s|| < r — e/2, so that i?(z, e/2) C i?(S', r)ni?(x, e) by the triangle inequality. 
Hence, 

B{S, t) n B{L, 77) n B(x, e) D B(L, ry) n B(z, e/2). 

We then conclude with Lemma 3. □ 

Lemma 8. Let ^ he the uniform distribution on a measurable subset A C MP 
of positive D-volume. Then for e> rj, 

supvI/(i?(y,e)ni?(L,77)) 

where the supremum is over y G and L G Ad, and the implicit constant 
depends only on d and vo\d(A). 

Proof. The proof is parallel to (and simpler than) that of Lemma 7. We omit 
details. □ 

A. 3. A Perturbation Bound 

In the proof of Theorem 1, we follow the strategy outlined in [42] based on 
verifying the following conditions (where (A4) has been simplified). Let Ik = 
{i : Xi e Xk} and let denote the matrix with coefficients indexed hy i,j G Ik 
and defined as 

Wij = ^ ad(xi,Xj,Xi^, . . . ,Xj^_J, Di='^Wij. 
ii,...,i„,_2e-ffc je/fc 

Let Wij = ii i G Ik, j & le, with k ^ £. Those are the coefficients of W and 
D under infinite separation, i.e., assuming S = 00. (In fact ^ > e + 2r is enough 
since we use the simple kernel.) 
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(Al) For all k, the second largest eigenvalue of is bounded above by 1 — 7. 
(A2) For all kj, with fc 7^ 

^ ^ on 

(A3) For all k and all i e h, 

(A4) For all k and all i,jeIk,Di< QDj- 

The following result is a slightly modified version of [42, Th. 2], stated and 
proved in [3, Th. 7]. See also [11, Th. 4.5]. Recall the matrix V defined in 
Algorithm 1. 

Theorem 2. Let vi, . . . , Vjv denote the row vectors of \ . Under (Al)-(A4), 
there is an orthonormal set {ri, . . . , r^^ } C such that, 

^ E E - ^^11' ^ ^Ql-'{K'., + Kvl). 
fc=i ie/fc 

Appendix B: Main Proofs 

For a set A, its cardinality is denoted by #A. Throughout the paper, C denotes 
a generic constant that does not depend on the sample size N and satisfies 
C> 1. 

B.l. Proof of Theorem 1 

Given Theorem 2, we turn to proving that the four conditions (A1)-(A4) hold 
with probability tending to one with vi = V2 = (Pn/C)""*^^; 7 > C~"^N~'^ and 
Q < C™ for some constant C > 0. Since TOlog(/9jv/C) ^ logiV, this implies 

max min ||vi — rfcjl — )■ 0. 

i=l,...,N k=l,...,K 

Therefore, since the r^'s are themselves orthonormal, the ii'-means algorithm 
with near-orthogonal initialization outputs the perfect clustering. 

We restrict ourselves to the case where r < (p^ \og{N)/Ny/'^ , for otherwise 
r] > e and HOSC is essentially SC, studied in [3]. With that bound on r, (12) 
reduces to e > {pj, \og{N) / NY ^ . By the same token, we assume that 77 < e, so 
that e > 77 > T + PjvC^- 

To verify conditions (A2), (A3) and (A4) we need to estimate the degree of 
each vertex under infinite separation and the edge weights under finite separa- 
tion. We start with the case of infinite separation. 
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Proposition 7. With probability at least 1 — N^P^/i^O ^ 

l{||x,-x,||<e/2}A^fce'' ^ Wl/^"'-^^ -< l{||x,-x,||<e}A^fce''; (33) 

and also, 

^i/(— 1) ^ ^^^d^ (34) 
uniformly over i, j G /j, and k ~ 1, . . . , K . 

Proof. Within a cluster, the hnear approximation factor in (3) is a function of 
the proximity factor. This is due to Lemma 2. Formally, let Gi^e denote the 
degree of in the neighborhood graph built by SC, i.e. 

= #{.?■ elk,j ^i: G S(x„ e)}, 

Then Proposition 7 is a direct consequence of Lemma 9, which relates G^^e to 
Wij and Di, and of Proposition 8, which estimates Gi^e- D 

Lemma 9. We have 
and, 

G,,,/2(G../2 - 1)^"-'^ < A < G,AG^,e " l)^"-^}, 

where r^™^ = r(r — 1) • • • (r — to + 1). 

Note that r{™> < r", and r{"> > (r/3)™ for r > m. 

Proof. We focus on the first expression, as the second expression is obtained by 
summing the first one over j & Ik, j i, where k is such that i Cz Ik- Therefore, 
fix i,j G Ik. The upper bound on Wij comes from the fact that 

diam(xi,Xj,xi, . . . ,x„_2) < e xi, . . . , x„j_2 e S(x,;, e). 

The lower bound comes from 

xi,...,x,„_2 e B(xi,e/2) diam(x,j,xi, . . . ,x,„_2) < e, 

and the fact that, 

xi, . . . ,x„i_2 e -B(5fc,T) n S(xi,e/2) ^ xi, . . . ,x,„_2 e ^(Ts^,?;), 

where is the point on Sk closest to x^. Indeed, take x e B{Sk,T) n B{xi, e/2) 
and let s G Sk such that ||x — s|| < t. By the triangle inequality, ||s — Si|| < e/2 + 
2t, so that, by Lemma 2, s e B{Ts^, K{e/2 + 2T)'^). Therefore, x G B(Ts.,K{e/2 + 
2t)^ + r). We then conclude with the fact that r < e and 77 > t + p^c^, with 

PjV — > 00. □ 

Note that N < K(^Nk, which together with (12) implies 

Nke^{lA{e/r))''-''>pl/{KC}\ogN, Vfc = l,...,if. (35) 
The following bound on G^^e is slightly more general than needed at this point. 
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Proposition 8. Assume that (35) holds. Then with probability at least 1 — 

G,,, X Nke'\l A (e/r))^-^ (36) 
uniformly over i E Ik and k = 1, . . . ,K . 

Proof. This is done in tlie proof of [3, Eq. (A4)] and we repeat the arguments 
here for future reference. Let ^'^ denote the uniform distribution on B[Sk,T). 
By definition, for any (measurable) set A, 

niA) ^ (37) 

volD(B(Sk,T)) 

Since G^^e is the sum of independent Bernoulli random variables, by Lemma 1, 
it suffices to bound it in expectation. Using Lemma 3, we have 



E (g,,,) = iVfc*fc(B(x„ e)) X NkC^il A (e/r)) 
Applying Lemma 1 and (35), we then get 



D-d 



Gi, > 16 EG,: 



V P (G,,e < EG,,,/8) < Ar-aCPw/C-R-O), 



We then apply the union bound and use the fact that N ■ N 2(Piv/(^C)) < 
7V-Pn/(^C)^ since ^ oo. □ 

We now turn to bounding the size of the edge weights Wtj under finite sep- 
aration. We do so by comparing them with the edge weights under infinite 
separation. 

Proposition 9. With probability at least 1 — N^^^ , 

{W,, - W,/)^^^"'-^^ -< l{||,,_,^||<,}iVeVp«. (38) 
uniformly over i € 1^, j (z li and k,£ — 1, . . . , K . 

Proof, li k ~ I, Wij — Wij is the sum of Q;d(xi,Xj,Xij, . . . ,Xi^_2) '^^^^ (distinct) 
ii, . . . , im-2 that are not all in Ik- When k ^ £, Wij — and Wij is again the 
same sum except this time over all (distinct) ii, . . . , im-2- Both situations are 
similar and we focus on the latter. We assume that ||xi — Xj || < e, for otherwise 
the bound is trivially satisfied. Note that this implies that p„?7 < S — 2t < e. 
Define 

G»,. = #{jV*:Xj eS(x„e)}, (39) 
which is the equivalent of G^^e under finite separation, as well as 

H^,e,r,iL) = #{j ^ ^ : X, e S(x„ e) n B{L, r;)}, 

and 

Htj.c.r, = maxiJi,e^,,(LM), 
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where the maxhxiuni is over all M C {1, . . . , N}, of size \M\ — d + 1 such that 
Xj G B{Lm,ti)- Then Proposition 9 is a direct consequence of Lemma 10, which 
relates G^^e and H* ^ ^ ^ to Wij, and of Propositions 10 and 11, which bound Gi^^ 
and H*j^^ jj, respectively. □ 

Lemma 10. There is a constant C > such that 

W,, < (G.,, + ir+\Hl,.,,c,)^"'-'-'^. (40) 
Proof. By definition of the affinity (3) and the triangle inequality, we have 

Wij < l{3LgLd:x„gB(x.,e)n-B(L,t7),VngAfu{»j}}, 
M 

where the sum is over M C {1, . . . , N} such that \M\ = m — 2 and i, j ^ M. For 
a subset Af C {1, . . . , A^}, of size \M\ = (i+ 1, let Lm denote the affine subspace 
spanned by {x„, n G A/}. By Lemma 4, we may limit ourselves to subspaces L 
that are generated by d + 1 data points, obtaining 



(41) 

M 

X ^ l{x„eB(xi,e)nS(Lflf,Cj)),VneM'U{ij}}i 
A/' 

where M is any subset of {1, . . . , N} of size d + 1, and M' is any subset of 
{1, . . . , N}\{MU{i,j}) such that M'UMU{i,j} is of size m. Such an M is of size 
at most m — d—1 and does not contain i or j. For any Af , i?(xj, €)nB{LM, Crj) 
contains at most i?*^ ^ data points other than x^ and Xj, so that the second 
sum is bounded by (-ff*j e cr;)™ ''^^ independently of Af. Similarly, B{xi,e) 
contains at most G^ ^ + 1 points, so the first sum is bounded by (G^^e + 1)"^'+^. 
The result follows. □ 



Proposition 10. Assume that (35) holds. Then with probability at least 1 — 

G,,, -< Ne'^il A {e/ryf-", (42) 
uniformly over i = 1, . . . , A^. 
Proof. We have 

E(G,,,) = ^7V,*,(B(x„e)). 

e 

Now, by Lemma 3, for all £ such that dist(xi, Si) < e + t, 
vI',(S(x.,e))^e''(lA(e/r))^-'*. 

Hgiicg 

E(G,,,)^A^e'(lA(e/T))^-''. 
We then use Lemma 1 and (35). □ 
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Proposition 11. With probability at least 1 — N^P'^ , 

Kj,e.n ^ ' (43) 

Pn 

uniformly over i G I^., j G I( and k =/= i in {1, . . . , K}. 

Proof. For L G Ad, Hi,t,ri{L) is a sum of independent Bernoulli random vari- 
ables, with expectation 

E = Nt^t{B{^^,e) n B{L, r^)). 

Take i such that 5(5'^, r) n i3(x,;, e) n 5(2^, 77) 7^ 0, and let x be in that set and 
s be the point on Si closest to x. Then by the triangle inequality and the fact 
that e > 77 > T, 

B{Si, t) n B(x„ e) n ry) C B{Su r) n B(s, 3e) n B{U, 3f]), 
where Lg is the translate of L passing through s. Therefore, 

^eXB{j^„e) n B{L, ry)) < l{dist(x,.5,)<e+r} ' sup *f (B(s, 3e) n B{L,, 3?])). 

seSe 

Our focus is on L such that Xi,Xj G B{L,ri), which transfers as Xi,Xj G 
B{Ls, 3r/) by the triangle inequality. Since x^ and Xj belong to different clusters, 
for a given £, at least one of them does not belong to Xg. Hence, by Lemma 6 
and the fact that 6 ^ rj > t + ne^, 6i{L,Ts) >- 5/e uniformly over s G S'^ and 
(Remember that 9i{L,T) denotes the largest principal angle between L and 
T.) Together with Lemma 5, we thus get 

^e{B{s, 3e) n B(L„ 3r/)) < Ce''{ri/5). 

Hence, by the fact that S > p^rj, we have 

E{H,^,JL)) < CNe\i^/5) < CNe^/p,. 

With Lemma 1 and (12), we then get 

SUpP(i/i,e,r,(i) > leCTVeVPiv) < iV-^PiV. 
L 

Hence, by the union bound, 

P (i?*j- > 16CA^eVp„) < iV+i • iV-2p" (44) 
The right hand side is bounded by A^^P" eventually. □ 
We now turn to verifying (Al)-(A4). 

• Verifying (A4): (34) immediately implies (A4) with Q = C" for some 
constant C > 0. 
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• Verifying (A3): Take fc = 1, . . . , if. By (33), (34) and (38), 
and also, 

E -< {N/N,)e-\p,/C)-'^"^-'\ 

Since N/Nk < N, e >- N^^/'^ and mlog(pjv/C) > logA^, we may take 
1^2 = (P«/C)-'"/'. 

• Verifying (A2): Take k,i=l,...,K, with k =^ L Then by (33), (34) and 
(38), 

Since e ;^ N^^/'^ and TOlog(pjv/C) ^ log-ZV, we may take i^i = (/On/C)"™- 

• Verifying (Al): As suggested in [42], we approach this through a lower 
bound on the Cheeger constant. Let be the matrix obtained from W/^ 
following SC. That has eigenvalue 1 with multiplicity 1 results from the 
graph being fully connected [14]. The Cheeger constant of is defined 
as: 

hk — mm , 

where the minimum is over all subsets / C /fc of size [/[ < Nk/2. The 
spectral gap of Zk is then at least /i^/2. By (33)-(34), there is a constant 
C > such that, 

d-,-1 ■ l{||x,-Xj||<e/2} 



hk>C-"\Nke'')-' min 



|7|<Ar,/2 [/[ 

From here, the proof is identical to that of [3, Eq. (Al)], which bounds 
the minimum from below by 1/Nk, so that hk > C^''^Ni^^. 



B. 2. Proof of Proposition 6 

From the proof of Theorem 1, it suffices to verify that (A2) and (A3) still hold 
under the conditions of Proposition 6, and in view of (19), we may focus on 14^^ 
for i G Ik and j £ Ii, with k ^ I, such that [jx^ — Xj|| < e and with Xj close to 
an intersection, specifically, for some p ^ ^, 

dist(xj,S'£ n Sp) < V, where v :— (sin6'int)~"'^(e A PnVi)- 
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In fact, we show that, under the conditions of Proposition 6, with probabihty 
at least 1 — 7jv, there is no such a pair of points (xi,Xj). For fixed {k,£,p), the 
probabihty that x^ ^ and Xj ^ 5'^ satisfy these conditions is 

E (vl/fe(S(x„ e))l{,^gB(5,n5,,.)}) , (45) 
after integrating over x;. By Lemma 3, 

where the imphcit constant depends only on K,d. Moreover, by condition (18), 

^eiBiSenSp,iy)) -< 

Therefore, using the union bound, the probability that there is such a pair of 
points is of order not exceeding 

NkNe ■ e'^zy'^-*"' = ]y2^d^d-d,^, = ^ Q. 

k,e 



B.3. Proof of Propositions 4 and 5 

Without loss of generality, we assume that 6o is small and that rj < e/10. Let 
^0 be the uniform distribution on (0,1)^ \ [Jf. B{Sk,SQ). By Lemma 3, this 
set has D-volume of order 1 - 0{K5^-'^), with KS^^"^ small since K is fixed. 
Therefore, for A C (0, 1)^, 

•^o{A)-yo\d (^A\[jB{Sk.5o)^ . 

Let /q C {!,..., iV} index the outliers and let A^o be the number of outliers. 

In view of how the procedures (01) and (02) work, we need to bound the 
degrees of non-outliers from below and the degrees of outliers from above. The 
following lower bound holds 

Nue''{l A {v/t))''-^ > {p./{KC)) log iV, Vfc = 1, . . . , if . (46) 

For (01), it comes from (11)-(12) and the fact that, for ah k ^ 0, Nk > 
N/{KCpj.), since N < K^Nk + iVo, implying Nk > {N - No)/{KC), and 
N — Nq > N / in our assumptions. For (02), it comes from (15) and (16) 
(and the inequality holds with p^^ in place of p^^ / {KC)). In the same vein, 

7Vfe(lA(ry/r))^-'^» A^7/^-^ Vfc = l,...,if. (47) 

We prove a result that is more general than what we need now. 
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Proposition 12. Assume (46) and (47). Then with probability at least 1 — 

JSf-PN/iKO^ 

N,e'{l A iv/r))^-' -< d]'^"^-'^ < N^^e^l A {i^/t))^°-''^^^~^^\ (48) 
uniformly over i ^ Ik, k ^ 0; and also, 

Oy^-'^ ^ (A^-A^o).''(lA(,/r))(^-'^)(i-^)ei-^l{,„<,+.j 

+iVe'^7y(^-^)(i-^), (49) 

uniformly over i G Iq, where ^ = 1 if t > e, and ^ = 1 A (rj/So), otherwise. 
Proof. Define 

Hi^e.Tj = max Hi ^_,^{L). 

Proposition 12 is a direct consequence of Lemma 11 which relates Di to Gi^^ 
(defined in Section B.l) and Hi ,:^ri, and of Propositions 13 and 14 (together with 
(47)), which bound G^.c and Hi^^ rj, respectively. □ 

Lemma 11. There is a constant C > such that 

HlXn ^ A < Gl^t'^HHU^^"'-'-'^ ■ (50) 

Proof. We get the upper bound by following the arguments in the proof of (38). 
For the lower bound, we simply have 

A > ^ l{3LeLd:XjGB(xi,e/2)nB(L,j;),VieM} 

J\/:|J\/|=m-l 
.{m-1} 



— i,e/2,rj 



□ 



The bounds for Gi,e and Hi^^^^i that follow are more general than needed at 
this point. In particular, the case of large r will only be useful in Section C. 

Proposition 13. Assume (46) holds withe in place ofrj. Then with probability 
at least 1 - Ar-p«/(^C)^ 

Nkc'il A (e/r))^--^ ^ G,^, ^ N^e^il A (e/r))^-^ + Noe^ . (51) 

uniformly over i (z Ik and k = 1, . . . , K . Also, 

G,,, ^{N- No)e\l A (e/r))^-'^l{,„<,+,} + TVe^. (52) 

uniformly over i (z Iq 

Proof. The proof is similar to that of Proposition 8. We bound G^^e in expecta- 
tion. Suppose i e /fe with k ^ 0. Then by Lemma 3 



E (G,,,) > 7Vfe*fcB(x„ e) X Nke^'il A (e/r)) 



D-d 
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For the upper bound, by Lemma 3 and the simple bound 
we have 

E (G,,,) = Ni^i{B{^,,e)) <{N- No)t\l A (e/r))^-'^ + TVoe^, 

f. 

with N — Nq < {K()Nk for any fc 7^ 0. As in as in Proposition 8, we then 
use Lemma 1 together with (46) and the union bound, to conclude the proof 
of (51). The proof of (52) is identical, except that, when (5o > r + e, we have 
^e{B(xi,e)) = Oife^O smdie Iq. □ 

Proposition 14. // (46) holds, then with probability at least 1 — N~P" /'^^'^\ 

(53) 

uniformly over i (z Ik and k =/= 0; and also, 

^ - ^o)e'(l A {v/r))''-'^liSo<.+r} + ^eV^'- (54) 
uniformly over i (z Iq 

Proof. First assume that i E with k ^ 0. For the lower bound in (53), let L 
be a subspace such that 

^^(^(x,, 6) n B{L, r;)) >- 6^(1 A (///r))^"'', 

which exists by the lower bound in Lemma 7. We have Hi^^^jj > iJ^^e ,,(L), and 
the term on the right hand side is a sum of independent Bernoulli random 
variables with expectation 

E (i/.,e.„(i)) - Nk^k{B{^,, e) n B{L, 77)) >- Nke^'il A (r^/r))^-"*. 

We then apply Lemma 1, using (46), and the union bound. For the upper bound 
in (53), the arguments are the same as in the proof of (43), except for the 
following bound in expectation, valid for any L g Ad, 

E(i/,,,^„(L)) = ^iV,*,(B(x„e)nB(L,77)) 

e 

^ (7V-iVo)e'(lA(77/T))^-'^ + 7VoeV"', 

by Lemmas 7 and 8. 

Now, assume that i E Iq. Again, the arguments are the same as in the proof 
of (43), except that the bounds in expectation are different. Specifically, if Sq > 
e + T, then 'i>e{B{xi,e) n B{L,rf)) — V£ 7^ 0, so that, by Lemma 8, for any 
L e Ad, 

E(if,,,,^(L))^7VoeV"''- 

Otherwise, 

E {H,,e,^{L)) ^{N- No)e^{l A (ry/T))^-'^^ + A^oeV"''- 

□ 
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We are now in a position to prove Propositions 4 and 5. We first consider 
(01). By (48) and (49), and the fact that r < 77 < we have 

maxi^y^™^') ^{N- iVo)e'^ -< {N/p^)e''. 

i 

On the one hand, by (48), d]'^'"'^'^^ >- Nke'^ >■ (iV/p„)e'*, uniformly over i e 
Ik, yk 0. Hence, since p„ — > 00, no non-outlier is identified as an outlier. On 
the other hand, by (49), for any i E Iq, 

since ^ -< ti/Sq -< p^^ and rj < e < p'^^^^^^'^K Hence, all outliers are identified 
as such. 

We now consider (02). On the one hand, by (48) and (16), and the expression 
for e and rj, we have 

jji/im-i) ^ ^ {rj/r jf-' >- pI log TV ^ p^A^eV"', 

uniformly over k ^ and i d Ik - Hence, no non-outlier is identified as an outlier. 
On the other hand, by (49), for any i G /q, 

^i/(™-i) ^ ^^d^D-d-^ ^ iVe'^,^^-^ 

which comes from m ^ log(iV)/ log(pjv). Hence, all outliers are identified as 
such. 

Appendix C: Proofs for the Estimation of Parameters 
C. 1 . Proof of Proposition 1 

Recalling the definition of G^^e in (39), we have 

Cor(e)-^G,,,. 

i 

Let = p'^^ and let ro be the integer defined by e^^+i < r < Ctq. Define 

rl := ((1 - d/D)ro + {d/D)r^)) A r„, 

and note that, for r < r^, (46) with e in place of rj is satisfied for e^- As there 
are only order logA^ such r's. Proposition 13 and the union bound imply that, 
with probability at least 1 - log(A^)A^-''"/(^^\ 

{N/p,fet{l A {er/r)f-^ -< Cor(e,) ^ iN/p,)hfil A (e./T))^-^ 

uniformly over r < r*^. Note that we used the fact that TV^e^ <C {N/ p,^)^ef{l A 
{er/T))^~'^, which holds since r, tq > 3. When this is the case, 

^ r 21ogiV-drlogp„ + 0(l), r<ro; 
{ 2logN-Dr\ogp^-iD-d)logT + Oil), r > n,. 
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In particular, for r <r*^, 

Ar - Ar+i ^ f rf + o(l), r<ro-l; 
logp„ \ D + o{l), r>ro + l. 

From the first part, we see that r > rg A (rjv — [2_D/(i]), since d < D — 1 
and ^ oo. To use the second part, note that + 2 < if, and only if, 
ro ^ Tn — l^D/d] . If this is the case, r < tq + 1. From this follows the statement 
in Proposition 1. 

C.2. Proof of Proposition 2 

We follow the proof of Proposition 1. We assume that d — d, which happens 
with probability tending to one. Let rjg — P^^^" and sq = tq — r. Define 

4 := {{2Dd + d- 2)/{D -d) + so) A (f - 1), 

and note that, for s < s^, (46) is satisfied for ef and 77^. Indeed, using the fact 
that ef > (log(iV)/7V)i/''p^^+i and r < p"*^", we get 

NkefilAirjs/r)f-^'- > (7V/(XC)p«)(log(A^)/A^)pL'''+'^'(l A pi^''-^^^^-'*^) 
= p„log(7V).p-^+'^^+^)^-'^-'')(^--'+, 

and the exponent in p„ is non-negative by the upper bound on s. As there are 
only order logiV such s's, Proposition 12 and the union bound imply that, with 
probability at least 1 - log(iV)7V~''"/(^^), 

Cor(e,,r;,) ^ {N/ p^)\-hf{l A {^Jr))^-" , 
Cor(e,,r;,) ^ {N/ p,)\f{l A {rjjT)f-'-^''+'^^^'''-'\ 

uniformly over s < s^. Note that we used the fact that 

Nhfv^-' « {N/p,)hf{lA{vs/r))^-'K 

When this is the case, 

= 21og7V-dflogp„+0(l), 

when s < Sq, and 

= 21ogiV - Dflogp„ + {-{D -d) + 0(l/m))(slogp„ + logr) + 0(1), 

when s > Sq. In particular, for s < s^, 

Bs - Bs+i ^ r 0(1), s < So - 1; 

logPw \D-d + o{l), s = so + l. 

From here the arguments are parallel to those used in Proposition 1. 
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